CancerGPT: Few-shot Drug Pair Synergy Prediction using Large Pre-trained Language Models

Large pre-trained language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology, has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structured data and sample size are limited, by extracting prior knowledge from text corpora. Our proposed few-shot learning approach uses LLMs to predict the synergy of drug pairs in rare tissues that lack structured data and features. Our experiments, which involved seven rare tissues from different cancer types, demonstrated that the LLM-based prediction model achieved significant accuracy with very few or zero samples. Our proposed model, the CancerGPT (with ~ 124M parameters), was even comparable to the larger fine-tuned GPT-3 model (with ~ 175B parameters). Our research is the first to tackle drug pair synergy prediction in rare tissues with limited data. We are also the first to utilize an LLM-based prediction model for biological reaction prediction tasks.


Introduction
Foundation models have become the latest generation of artificial intelligence (AI) (Moor et al. (2023)). Instead of designing AI models that solve specific tasks one at a time, such foundation models or "generalist" model can be applied to many downstream tasks without specific training. For example, large pre-trained language model (LLM), such as GPT-3 (Brown et al. (2020)) and GPT-4 (OpenAI (2023)), has been a game changer in foundation AI model (Mitchell and Krakauer (2023)). LLM can apply its skills to unfamiliar tasks that it has never been trained for, which is few-shot learning or zero-shot learning. This is due in part to multitask learning, which enables LLM to unintentionally gain knowledge from implicit tasks in its training corpus (Radford et al. (2018)). Although LLM has shown its proficiency in few-shot learning in various fields (Brown et al. (2020)), including natural language processing, robotics, and computer vision (Veit et al. (2017); Brown et al. (2020); Wertheimer and Hariharan (2019)), the generalizability of LLM to unseen tasks in more complex fields such as biology has yet to be fully tested. In order to infer unseen biological reactions, knowledge of participating entities (e.g., genes, cells) and underlying biological mechanisms (e.g., pathways, genetic background, cellular environment) is required. While structured databases encode only a small portion of this knowledge, the vast majority is stored in free-text literature which could be used to train LLMs. Thus, we envision that, when there are limited structured data and limited sample sizes, LLMs can serve as an innovative approach for biological prediction tasks, by extracting prior knowledge from unstructured literature. One of such few-shot biological prediction tasks with a pressing need is a drug pair synergy prediction in understudied cancer types.
Drug combination therapy has become a widely accepted strategy for treating complex diseases such as cancer, infectious diseases, and neurological disorders. In many cases, combination therapy can provide better treatment outcomes than single-drug therapy. Predicting drug pair synergy has become an important area of research in drug discovery and development. Drug pair synergy refers to the enhancement of the therapeutic effects of two (or more) drugs when used together compared to when each drug is used alone. The prediction of drug pair synergy can be challenging due to a large number of possible combinations and the complexity of the underlying biological mechanisms (Zagidullin et al. (2019)). Several computational methods have been developed to predict drug pair synergy, particularly using machine learning. Machine learning algorithms can be trained on large datasets of in vitro experiment results of drug pairs to identify patterns and predict the likelihood of synergy for a new drug pair. However, most of the data available comes from common cancer types in certain tissues, such as breast and lung cancer; very limited experiment data are available on certain types of tissues, such as bone and soft tissues (Fig. 1). Obtaining cell lines from these tissues can be physically difficult and expensive, which limits the number of training data available for drug pair synergy prediction. This can make it challenging to train machine learning models that rely on large datasets.
Early studies in this area have relied on relational information or contextual information to extrapolate the synergy score to cell lines in other tissues, (Chen and Li (2018); Sun et al. (2020); ; Kuru et al. (2022); ), ignoring the biological and cellular differences in these tissues. Another line of studies has sought to overcome the discrepancy between tissues by utilizing diverse and high-dimensional features, including genomic (e.g., gene expression of cell lines) or chemical profiles (e.g., drug structure) (Preuer et al. (2018); Liu and Xie (2021); Kuru et al. (2022); Hosseini and Zhou (2023); Kim et al. (2021)). Despite the promising results in some tissues (with abundant data), these approaches cannot be applied to tissues with too limited data to adapt its model with the large number of parameters for those high-dimensional features.
In this work, we aim to overcome the above challenge by LLMs. We hypothesize that cancer types with limited structured data and discrepant features still have good information in scientific literature. Manually extracting predictive information on such biological entities from literature is a complex task. Our innovative approach is to leverage prior knowledge in scientific literature encoded in LLMs. We built a few-shot drug pair synergy prediction model that transforms the prediction task into a natural language inference task and generate answers based on prior knowledge encoded in LLMs. Our experimental results demonstrate that our LLM-based few-shot prediction model achieved significant accuracy even in zero shot setting (i.e., no training data) and outperformed strong tabular prediction models in most cases.
This remarkable few-shot prediction performance in one of the most challenging biological prediction tasks has a critical and timely implication to a broad community of biomedicine because it shows a strong promise in the "generalist" biomedical artificial intelligence (Moor et al. (2023)).

Drug pair synergy prediction
Lots of methods have been proposed to predict drug pair synergy in recent years. Based on the data type to use, these methods can be classified either as a multi-way relational method or as a context-aware method. Multi-way relational methods (Chen and Li (2018); Sun et al. (2020); ) use drug and cell line's relational information without any further chemical or gene information as input and predict drug pair's synergy. Contextaware methods (Preuer et al. (2018); Liu and Xie (2021);Kuru et al. (2022); Hosseini and Zhou (2023)) further utilized chemical and gene information from drugs and cell lines to predict drug pair's synergy, which usually contains drug-drug, drug-gene, gene-gene interactions, and cellular environment. These methods usually achieve good performance with rich features on common tissues. However, both approaches do not apply to the cell lines in rare tissues with the limited size of data and cellular information. Kim et al. (2021) uses transfer learning to extend the prediction model trained in common tissues to some of the rare tissues with relatively rich data and cellular features. However, it cannot be utilized for rare tissues with extremely limited data and cellular information.  Figure 1: Few-shot prediction in biology. A. Different from task-specific approach, large pre-trained language model can perform new tasks which are not been explicitly trained for. B. Drug pair synergy prediction in rare tissues is an important examples of numerous few-shot prediction tasks in biology. C. Large pre-trained language model can be an innovative approach for few-shot prediction in biology thanks to its prior knowledge encoded in its weight.

Few-shot learning on tabular data
Traditional supervised learning algorithms can struggle due to the difficulty in obtaining enough labeled data for classification. Few-shot learning is an emerging field that aims to address this issue by enabling machines to learn from a few examples rather than requiring a large size of labeled data. Meta-learning (Finn et al. (2017); Wang et al. (2023); Gao et al. (2023)) is one technique for few-shot learning. It trains a model on a set of tasks in a way that allows it to quickly learn to solve new, unseen tasks with a few examples. Another technique is data augmentation (Nam et al. (2023); Yang et al. (2022)), which generates new examples by transforming existing data. One promising but less explored direction is to leverage LLMs, particularly when prior knowledge encoded in a corpus of text can be served as a predictive feature. TabLLM (Hegselmann et al. (2023)) is one such framework. It serializes the tabular input into a natural language text and prompts LLM to generate predictions. Leveraging TabLLM, we investigated the effectiveness of LLMs in few-shot learning tasks in biology.

Language models for biomolecular sequence analysis
There has been a growing interest in using language models for biomolecular sequence analysis, and one approach involves the training of language models with biomolecular data (Madani et al. (2023); NVIDIA (2023)). These models learn the language of biomolecules, such as DNA, RNA, and protein sequences, similar to how GPT-2 (Radford et al. (2018)) or GPT-3 (Brown et al. (2020)) learns human language. However, our study takes a different approach. Rather than training a language model specifically for biomolecular data, we use a language model that has been pre-trained on a corpus of human language text. This pretrained model is used as a few-shot prediction model for drug pair synergy data, allowing us to make accurate predictions with minimal training data. By leveraging the power of pre-trained language models, we are able to make use of existing resources and obtain generalizability to diverse biological prediction tasks beyond biomolecule sequence analysis.

Results
We developed CancerGPT, a few-shot drug pair synergy prediction model for rare tissues. Leveraging LLMs-based tabular data prediction model (Hegselmann et al. (2023)), we first converted the prediction task into a natural language inference task and generated answer using prior knowledge from the scientific literature encoded in LLM's pre-trained weight matrices (Section 5.3, Fig. 2). We presented our strategy to adapt the LLM to our task with only a few shots of training data in each rare tissue in Section 5.5 and Fig. 3.
To evaluate the performance of our proposed CancerGPT model and other LLM-based models, we conduct a series of experiments in which we compare the model with various other tabular models (Section 6). We measured accuracy using the area under the precisionrecall curve (AUPRC) and the area under the receiver operating curve (AUROC) under the different settings. We considered different few-shot learning scenarios, where the model is provided with a limited number k of training data to learn from (k=0 to 128). By varying the number of shots, we can examine the model's ability to adapt and generalize with minimal training data. Next, we investigated the performance of CancerGPT and other LLM-based models across different tissue types. Since cancer is a highly heterogeneous disease with various subtypes, it is crucial for the model to be able to accurately predict outcomes in diverse tissue contexts. We then investigated whether the LLM's reasoning for its prediction is valid by checking its argument with scientific literature. "Drug combination and cell line: The first drug is AZD1775. The second drug is AZACITIDINE. The cell line is EW-8. Tissue is bone.
The first drug's sensitivity using relative inhibition is 25.687. The second drug's sensitivity using relative inhibition is 1.752. Synergy:" Prompt "why"? Fact check LLM's reasoning (CancerGPT, GPT-2, GPT-3) Figure 2: Study workflow. We first converted the tabular input to natural text and created a task-specific prompt (Section 5.2). The prompt was designed to generate binary class predictions (e.g., "Positive", "Not positive"). We fine-tuned the LLMs (GPT-2 and GPT-3) with k-shots of data in rare tissues (Section 5.5). We further tailored GPT-2 by fine-tuning it with a large amount of common tissue data, in order to adjust GPT-2 in the context of drug pair synergy prediction (Can-cerGPT, Section 5.4). We evaluated and compared the prediction models with a different number of shots and tissues (Section 6). We investigated the LLM's reasoning based on factual evidence. Methods  0  2  4  8  16  32  64  128 Pancreas  Table 1: AUPRC of k-shot learning on seven tissue sets. n 0 :=total number of nonsynergistic samples (not positive), n 1 :=total number of synergistic samples (positive). We used 20% data as a test set in each rare tissue, while ensuring the binary labels were equally represented.  Table 2: AUROC of k-shot learning on seven tissues sets. We evaluated the accuracy of our synergy prediction models. We calculated the AUPRC and AUROC of the LLM-based models (CancerGPT, GPT-2, GPT-3) and baseline models (XGBoost, TabTransformer) ( Table 1, 2). Due to an imbalance in positive and non-positive labels, we reported both AUPRC and AUROC. Details on the classification task and threshold of synergy are discussed in Section 6.3.

Number of shots
Number of training data and accuracy Overall, the LLM-based models (CancerGPT, GPT-2, GPT-3) achieved comparable or better accuracy in most of the cases compared to baselines. In the zero-shot scenario, the LLM-based models generally had higher accuracy than the baseline models in all experiments except stomach and bone. As the number of shots increased, we observed mixed patterns across various tissues and models. TabTransformer consistently exhibited an increase in accuracy with more shots. CancerGPT showed higher accuracy with more shots in the endometrium and soft tissue, and GPT-3 showed higher accuracy with more shots in the liver, soft tissues, and bone, indicating that the information gained from a few shots of data complements the prior knowledge encoded in CancerGPT and GPT-3.
However, the LLM-based models sometimes did not show significant improvements in accuracy in certain tissues, such as the stomach and urinary tract, suggesting that the additional training data do not always improve the LLM-based models' performance. With the maximum number of shots (k=128), the LLM-based model, specifically GPT-3, was on par with TabTransformer, achieving the highest accuracy with the pancreas, liver, soft tissue, and bone, while TabTransformer achieved the best accuracy with endometrium, stomach, and urinary tract.
Tissue types and accuracy The accuracy of the models varied depending on the tissue types, as each tissue possessed unique characteristics and had different data size. In pancreas and endometrium tissues, GPT-3 showed high accuracy with only a few shots (k=0 or 2). Generally, the cell lines from the two tissues are difficult to obtain and have a limited number of well-established cell lines, which makes them less investigated. For example, the pancreas is located deep within the abdomen, making it difficult to access and isolate cells without damaging them. The endometrium is a complex tissue that undergoes cyclic changes during the menstrual cycle, and this dynamic process complicates the cell culturing process. Due to this limited training data, few-shot drug pair synergy prediction in these tissues required even higher generalizability.
In the liver, soft tissue, and bone, GPT-3 again achieved the highest accuracy than any other models, including one that trained with common tissues (TabTransformer, Can-cerGPT). This may be because these tissues have unique cellular characteristics specific to their tissue of origin that training with common tissues may not help predict accurately. For example, hepatic cell lines (originated from liver tissue) are often used in research on drug metabolism and toxicity and have unique drug response characteristics due to high expression of drug-metabolizing enzymes such as cytochrome P450s (Guo et al. (2011)). Bone cell lines have bone-specific signaling pathways that can affect drug responses, and the extracellular matrix composition and structure in bone tissue can also impact drug delivery and efficacy (Lin et al. (2020)).
On the other hand, models trained with common tissues (TabTransformer, CancerGPT) achieved the best accuracy in the stomach and urinary tract tissues of all k, indicating that the prediction learned from common tissues can be extrapolated to these tissues. Particularly, CancerGPT achieved the highest accuracy with no training sample (k=0) in the stomach.
Comparing LLM-based models When comparing LLM-based models, CancerGPT and GPT-3 demonstrated superior accuracy compared to GPT-2 in most tissues. GPT-3 exhibited higher accuracy than CancerGPT in tissues with limited data or unique characteristics, while CancerGPT performed better than GPT-3 in tissues with less distinctive characteristics, such as the stomach and urinary tract. The higher accuracy of CancerGPT compared to GPT-2 highlights that well-balanced adjustment to specific tasks can increase the accuracy while maintaining generalizability. However, the benefits of such adjustments may diminish with larger LLM models, such as GPT-3 (175B parameters), in situations where more generalizability is required. The fact that CancerGPT with smaller parameters (124M parameters) achieved the comparable accuracy to GPT-3 with larger parameters (175B parameters) implies that further fine-tuning of GPT-3 could achieve even higher accuracy.

Fact check LLM's reasoning
We evaluated whether the LLM can provide the biological reasoning behind its prediction. In this experiment, we used zero-shot GPT-3 because other fine-tuned LLM-based models compromised its language generative performance during the fine-tuning and were not able to provide coherent responses. To do this, we randomly selected one true positive prediction and examined whether its biological rationale was based on factual evidence or mere hallucination. Our example was the drug pair AZD4877 and AZD1208 at cell line T24 for urinary tract tissue. We prompted the LLMs with "Could you provide details why are the drug1 and drug2 synergistic in the cell line for a given cancer type?". Details on prompt generation are discussed in Supplementary 1. We evaluated the generated answer by comparing it with existing scientific literature. We found that the LLM provided mostly accurate arguments, except for two cases (Table 3) in which no scientific literature exists. By combining these individual scientific facts the LLM inferred the unseen synergistic effect. Generally, drugs targeting non-overlapping proteins in similar pathways are more likely to be synergistic (Cheng et al. (2019); Tang and Gottlieb (2022)). In this case, both AZD4877 and AZD1208 target similar pathways that inhibit tumor cell divisions without overlapping protein targets. The Loewe synergy score of this pair at T24 was 46.82, meaning strong positive synergistic effect.

Example of prediction results
As an example, we listed predicted synergistic drug pairs for stomach and soft tissue using CancerCPT (Table S3.1, S3.2) and bone and liver tissue using GPT-3 (Table S3.3, S3.4). We randomly selected two true positive, false positive, true negative, and false negative prediction examples. We discovered that Loewe synergy scores of the true negative or false negative prediction examples were close to the threshold we used to categorize the label (i.e., Loewe score >5). This suggests that accuracy may vary significantly by different thresholds for determining positive synergy. Setting more extreme thresholds (e.g., >10, >30), like previous models (Kim et al. (2021), Kuru et al. (2022), Hosseini and Zhou (2023)), may increase the prediction accuracy.

Discussion
Summary Our study investigates the potential of LLMs as a widely applicable few-shot prediction model in the field of biology. Specifically, we propose a new few-shot model for predicting drug pair synergy, which can be used in rare tissues with few or no training samples available. We transformed tabular data prediction into natural language inference tasks and fine-tuned LLMs (GPT-2, GPT-3) with very few samples in each tissue. The CancerGPT model, which was further tuned with a large amount of common tissue data, showed comparable accuracy to the few-shot tuned GPT-3 model, indicating that tailoring Excerpt of the generated answer Fact check and reference "The combination of AZD-4877 and AZD1208 has been studied in T24 cells...to be synergistic in reducing bladder cancer cell growth and metastasis" False. No study conducted on this drug pair "The combination was also found to target multiple pathways involved in the growth and spread of bladder cancer cells." True. AZD1208 is a PIM1 inhibitor. PIM1 is overexpressed in bladder cancer initiation and progression (Guo et al. (2010)). AZD4877 is a drug designed to target bladder cancer (Jones et al. (2013)). "...Specifically, AZD-4877 was found to inhibit the activation of proteins involved in the promotion of tumor growth..." True. AZD4877 is a drug designed to target bladder cancer (Jones et al. (2013)).
"...AZD1208 was found to inhibit proteins associated with the inhibition of tumor growth." True. AZD1208 inhibits the cell growth by suppressing p70S6K, 4EBP1 phosphorylation, and messenger RNA translation (in acute myeloid leukemia) (Cortes et al. (2018)). "This combination was also effective at reducing the production of inflammatory mediators such as cytokines, which are known to contribute to tumor progression." False. AZD1208 is a pan-PIM kinase inhibitor, and PIM kinases are downstream effectors of cytokine (National Cancer Institute (2011)). However, AZD4877 has no evidence in reducing inflammatory mediators. "...these two drugs have been shown to reduce levels of apoptosis inhibitors, which can also play a role in tumor progression." True.
AZD1208 induce cell apoptosis (Cervantes-Gomez et al. (2019)). AZD4877 is a inhibitor of Eg5, which promotes cell apoptosis (Borthakur et al. (2009)). Table 3: Example of generated answer when the LLM was asked to provide its reasoning for its prediction GPT-3 to specific tasks could further improve prediction accuracy. The LLM's reasoning for its prediction revealed that it implicitly infers unseen synergistic effects by combining several independent scientific facts.
Why drug pair synergy prediction to evaluate LLMs The prediction of drug pair synergy in uncommon tissues serves as an excellent benchmark task for evaluating LLMs in few-shot learning within the field of biology. This prediction requires incorporating multiple pieces of information, such as drug and cell line, as well as the sensitivity of drugs to the cell lines, in order to infer the synergistic effects. While detailed information on these entities can be found in scientific papers, the interaction effect, or synergistic effect, is primarily available through biological experiments. To effectively assess LLMs' inference capabilities, one must employ a prediction task where the ground truth is not explicitly available in text format but can be determined through alternative sources for model evaluation. Typically, drug pair synergy scores are obtained through high-throughput testing facilities involving robot arms (He et al. (2018)). Therefore, individual records of the experiments are seldom recorded in academic literature, decreasing the likelihood of their use as training data for LLMs. Additionally, few studies have been conducted on rare tissues regarding their synergy prediction models, and their synergy prediction outcomes are not explicitly stated in text format. Another similar task is predicting the sensitivity of a single drug in a cell line; however, since the sensitivity of individual drugs is extensively researched and welldocumented in publications, the LLM model may merely recollect from the text rather than infer unseen tasks.
Comparison to existing drug pair synergy prediction models It should be noted that it was not possible to compare our LLM-based models with previous predictions of drug pair synergy. The majority of models necessitate high-dimensional features of drugs and cells (e.g., genomic or chemical profiles), along with a substantial amount of training data, even the one specifically designed for rare tissue (Kim et al. (2021)). This kind of data is not easily accessible in rare tissues, which makes it challenging to carry out a significant comparison. Our model is designed to address a common but often overlooked situation where we have limited features and data. Thus, we compared the LLM-based models with other tabular models that share the same set of inputs.
Contribution The contribution of our study can be summarized as follows. In the area of drug pair synergy prediction in rare tissues, our study is the first to predict drug pair synergy on tissues with very limited data and features, which other previous prediction models have neglected. This breakthrough in drug pair synergy prediction could have significant implications for drug development in these cancer types. By accurately predicting which drug pair will have a synergistic effect on these tissues in which cell lines are expensive to obtain, biologists can directly zoom into the most probable drug pairs and perform in vitro experiments in a cost effective manner. Our study also delivers generalizable insights about LLMs in the broader context of biology. To the best of our knowledge, our study was the first to investigate the use of LLMs as an few-shot inference tool based on prior knowledge in the field of biology, where much of the latest information is presented in unstructured free text (such as scientific literature). This innovative approach could have significant implications for advancing computational biology where obtaining abundant training data is not readily possible. By leveraging the vast amounts of unstructured data available in the field, LLMs can help researchers bypassing the challenge of limited training data when building data-driven computational models.
Furthermore, this LLM-based few-shot prediction approach could be applied to a wide range of diseases beyond cancer, which is currently limited by the scarcity of available data. For instance, this approach could be used in infectious diseases, where the prompt identification of new treatments and diagnostic tools is crucial. LLMs could help researchers quickly identify potential drug targets and biomarkers for these diseases, resulting in faster and more effective treatment development.
Limitations The present study, while aiming to showcase the potential of LLMs as a few-shot prediction model in the field of biology, is not without its limitations. To fully establish the generalizability of LLMs as a "generalist" artificial intelligence, a wider range of biological prediction tasks must be undertaken to validate it. Additionally, it is crucial to investigate how the information gleaned from LLMs complements the existing genomic or chemical features that have traditionally been the primary source of predictive information. In future research, we plan to delve deeper into this aspect and develop an ensemble method that effectively utilizes both existing structured features and new prior knowledge encoded in LLMs.
Furthermore, while we observed that GPT-3's reasoning was similar to our own when fact-checking its argument with scientific literature in one example, it is important to note that the accuracy of its arguments cannot always be verified and may be susceptible to hallucination. It is reported that LLMs can also contain biases that humans have (Schramowski et al. (2022)). Therefore, further research is necessary to ensure that the LLM's reasoning is grounded in factual evidence. Despite these limitations, our study provides valuable insights into the potential of LLMs as a few-shot prediction model in biology and lays the groundwork for future research in this area.

Problem Formulation
Objective Our objective is to predict whether a drug pair in a certain cell line has a synergistic effect, particularly focusing on rare tissues with limited training samples. Given an input x = {d 1 , d 2 , c, t, ri 1 , ri 2 } of drug pair (d 1 , d 2 ), cell line c, tissue t, and the sensitivity of the two drugs using relative inhibition, the prediction model is where y is the binary synergy class (1 if positive synergy; 0 otherwise). Prior research ; Hosseini and Zhou (2023)) has employed three different scenarios for predicting drug pair synergy (random split, stratified by cell lines, stratified by drug combinations). Our task is to predict synergy when the data are stratified by tissue, which is a subset of cell lines.
Why tabular input As discussed in Section 2, relationships learned in a tissue cannot be well generalizable to other tissues that have different cellular environments. This biological difference poses a challenge in predicting drug pair's synergy in tissues with a limited number of samples. The limited sample size makes it even more difficult to incorporate typical cell line features, such as gene expression level, which has large dimensionality (e.g., ∼ 20,000 genes). Due to this data challenge, the drug pair synergy prediction model is then reduced to build a prediction model with limited samples (few or zero-shot learning) with only limited tabular input feature types. Specific input features were described in Section 6.

Synergy prediction models based on Large pre-trained language models
Converting tabular input to natural text To use an LLM for tabular data, the tabular input and prediction task must be transformed into a natural text. For each instance of tabular data (Fig. 2), we converted the structured features into text. For example, given the feature string (e.g., "drug1", "drug 2", "cell line", "tissue", "sensitivity1", "sensitivity2") and its value (e.g., "lonidamine", "717906-29-1", "A-673", "bone", "0.568", "28.871"), we converted the instance as "The first drug is AZD1775. The second drug is AZACITIDINE. The cell line is SF-295. Tissue is bone. The first drug's sensitivity using relative inhibition is 0.568. The second drug's sensitivity using relative inhibition is 28.871." Other alternative ways to convert the tabular instance into the natural text are discussed in previous papers (Li et al. (2020); Narayan et al. (2022)).

Converting prediction task into natural text
We created a prompt that specifies our tasks and guides the LLM to generate a label of our interest. We experimented with multiple prompts. One example of the prompts we created was "Determine cancer drug combination synergy for the following drugs. Allowed synergies: Positive, Not positive. Tabular Input . Synergy:". As our task is a binary classification, we created the prompt to only generate binary answers ("Positive", "Not positive"). Comparing these multiple prompts (Supplementary 1), the final prompt we used in this work was "Decide in a single word if the synergy of the drug combination in the cell line is positive or not. {{ Tabular Input }}. Synergy:".

LLM-based prediction model
Large pre-trained language models We built our prediction models by tuning GPT-2 and GPT-3 into our tasks (Fig. 2). GPT-2 is a Transformer-based large language model which was pre-trained on a very large corpus of English data without human supervision. It achieved state-of-the-art results on several language modeling datasets in a zero-shot setting when it was released, and it is the predecessor of GPT-3 and GPT-4. GPT-2 (Radford et al. (2018)) has several versions with different sizes of parameters, GPT-2, GPT-Medium, GPT-Large, and GPT-XL. We used GPT-2 with the smallest number of parameters (regular GPT-2, 124 million) in this work to make the model trainable on our server. To adjust the model for a binary classification task, we added a linear layer as a sequence classification head on top of GPT-2, which uses the last token of the output of GPT-2 to classify the input. The cross-entropy loss was used to optimize the model during the fine-tuning process (discussed below).
GPT-3 (Brown et al. (2020)) is a Transformer-based autoregressive language model with 175 billion parameters, which achieved state-of-the-art performance on many zero-shot and few-shot tasks when it was released. GPT-3.5, including ChatGPT (OpenAI (2022)), a famous fine-tuned model from GPT-3.5, is an improved version of GPT-3. However, the GPT-3 model and its parameters are not publicly available. Although the weight of the GPT-3 model is undisclosed, OpenAI offers an API (OpenAI (2021)) to fine-tune the model and evaluate its performance. We utilized this API to build drug pair synergy prediction models through k-shot fine-tuning. There are four models provided by OpenAI for finetuning, Davinci, Curie, Babbage, and Ada, of which Ada is the fastest model and has comparable performance with larger models for classification tasks. For that reason, we use GPT-3 Ada as our classification model. After uploading the train data, the API adjusted the learning rate, which is 0.05, 0.1, or 0.2 multiplied by the original learning rate based on the size of the data, and fine-tuned the model for four epochs. A model of the last epoch was provided for further evaluation.

CancerGPT
We further tailored GPT-2 by fine-tuning it with a large amount of common tissue data, in order to adjust GPT-2 in the context of drug pair synergy prediction. We named this model CancerGPT. CancerGPT used the same structure as the modified GPT-2 mentioned above. A linear layer was added to the top of GPT-2, which uses the last token of the GPT-2 output to predict the label. To use the pre-trained GPT-2 model, the same tokenizer was used as GPT-2. Left padding was used to ensure the last token was from the prompt sentence. The cross-entropy loss was used to optimize the model.
CancerGPT was first fine-tuned to learn the relational information between drug pairs from common tissues, similar to collaborative filtering (Suphavilai et al. (2018)) (Fig. 3). This approach was based on the assumption that certain drug pairs exhibit synergy regardless of the cellular context, and therefore, the relational information between drug pairs in common tissues can be used to predict synergy in new cell lines in different tissues (Hosseini and Zhou (2023)). Additionally, we incorporated information on the sensitivity of each individual drug to the given cell line, using relative inhibition score as a measure of sensitivity ). By doing so, we were able to gather a more detailed and nuanced understanding of the relationship between drugs and cell lines.
Subsequently, we utilized CancerGPT as one of the pre-trained LLMs and fine-tuned to k shots of data in each rare tissue (as discussed in the following section). All the LLM models use the tabular input that was converted to natural text and share the same prompt.

k-shot fine-tuning strategy
The LLM-based models had different training and fine-tuning strategies (Fig. 3). Samples of common tissues were split into 80% train data and 20% validation data for CancerGPT. The models were trained using train data and evaluated by validation data to determine the models with specific hyperparameters to be used for further fine-tuning on rare tissues. For the GPT-2 and GPT-3 based prediction models, we directly used pre-trained parameters from GPT-2 (Radford et al. (2018)) using Huggingface's Transformers library (Wolf et al. (2020)) and GPT-3 Ada from OpenAI (Brown et al. (2020)) respectively.  Figure 3: Training strategy of baseline and proposed LLM-based models. General tabular models and CancerGPT were first trained with samples from common tissues then k-shot fine-tuned with each tissue of interest. GPT-2 and GPT-3 are pre-trained models, and we fine-tuned them with k shots of data in each tissue.
All these models were then fine-tuned with k shots of data in each of the rare tissues. For bone, urinary tract, stomach, soft tissues, and liver, we performed experiments with k from [0,2,4,8,16,32,64,128]. For endometrium and pancreas, because of the limited number of data, we implemented experiments with k from [0,2,4,8,16,32] from the endometrium, and only zero shot (k = 0) for the pancreas.
With the limited number of shots, a careful balance of binary labels in the train and test set was critical. We partitioned the data into 80% for training and 20% for testing in each rare tissue, while ensuring the binary labels were equally represented in both sets. We randomly selected k shots from the training for fine-tuning, while maintaining consistency with previously selected shots and adding new ones. Specifically, we maintained the previously selected k shots in the training set and incremented additional k shots to create 2 × k shots. The binary label distribution in each k shot set followed that of the original data, with at least one positive and one negative sample included in each set. For evaluation stability, the test data was consistent across different shots for each tissue.

Dataset
We utilized a publicly accessible extensive database of drug synergy from DrugComb Portal (Zagidullin et al. (2019)), which is an open-access data portal where the results of drug combination screening studies for a large variety of cancer cell lines are accumulated, standardized, and harmonized. The database contains both drug sensitivity rows and drug pair synergy rows. After filtering the available drug pair synergy rows, the data contains 4,226 unique drugs, 288 cell lines, with a total of 718,002 drug pair synergy rows. We employed the Loewe synergy score, which ranges from -100 (antagonistic effect) to 75 (strong synergistic effect), for drug combination synergy. (Greco et al. (1995)) The Loewe synergy score quantifies the excess over the expected response if the two drugs are the same compound (Ianevski et al. (2017);Yadav et al. (2015)). In this paper, we focused on cell lines from rare tissues. We defined the rare tissues as the ones with less than 4000 samples, which include the pancreas (n=39), endometrium (n=68), liver (n=213), soft tissues (n=352), stomach (n=1,190), urinary tract (n=2,458), and bone (n=3985). We tested our models with each of the rare tissues.

Baseline models
We compared the LLM-based prediction model with two other tabular models that take the same set of inputs. We specifically used XGBoost (Chen and Guestrin (2016)) and TabTransformer (Huang et al. (2020)). XGBoost is one of the gradient-boosting algorithms for supervised learning based on tree ensemble. for structured or tabular data. It is widely used in large-scale drug synergy data (Sidorov et al. (2019); Celebi et al. (2019)).
TabTransformer is a self-attention-based supervised learning model for tabular data. TabTransformer applies a sequence of multi-head attention-based Transformer layers on parametric embeddings to transform them into contextual embeddings, in which highly correlated features will be close to each other in the embedding space. Considering the highly correlated nature of drugs in our data, TabTransformer can be a very strong baseline in this work To train the two baseline models, we first converted the drugs and cell lines in the tabular data into indicators using one-hot coding. Tissue information was not used in training because the models will be tested in one specific rare tissue that is not used in training. Neither XGBoost nor TabTransformer is a pre-trained LLM; thus, no further contextual information can be inferred through the unseen tissue indicator. For XGBoost, all the variables (drugs, cell lines, and sensitivities) were used as input to predict the drug pair synergy. For TabTransformer, we first trained an embedding layer from scratch on the categorical variables (drugs and cell lines) and passed them through stacked multi-headed attention layers, which we then combined with the continuous variables (sensitivities). This combination then passes through feed-forward layers, which have a classification head.

Hyperparameter Setting
The predicted output was a binary label indicating the presence of a synergistic effect, with a Loewe score greater than 5 indicating a positive result. We used AUROC and AUPRC to evaluate the accuracy of classification. Regression tasks were not possible in our LLMbased models because our model can only generate text-based answers ("positive" or "not positive"), with poor precision in accurately quantifying the synergy value.
XGBoost was used with a boosting learning rate of 0.3. The number of the gradient boost trees was set to 1000 with a maximum tree depth of 20 for base learners. TabTransformer was used with a learning rate of 0.0001 and a weight decay of 0.01. The model was trained for 50 epochs on common tissues. During the training, the model with the best validation performance was selected for further fine-tuned on rare tissues. For each k shot in each tissue, the model was fine-tuned using the same learning rate and weight decay for 1 epoch and tested with AUPRC and AUROC. Details in the hyperparameter setting are discussed in Supplementary 2.
CancerGPT was first fine-tuned with pre-trained regular GPT-2 for 4 epochs on common tissues. The learning rate was set to be 5e-5 and weight decay was set to be 0.01. Then the model was fine-tuned for k shots in rare tissues. The same hyperparameters are used in training. The model was finally tested with AUPRC and AUROC.
GPT-2 and GPT-3 are directly fine-tuned on rare tissues with pre-trained parameters from regular GPT-2 and GPT-3 Ada. For each k shot in each tissue, GPT-2 is fine-tuned for 4 epochs using a learning rate of 5e-5 and a weight decay of 0.01. The hyperparameters of GPT-3 are adjusted by OpenAI API based on the data size. The model was also fine-tuned for 4 epochs. GPT-2 and GPT-3 fine-tuned models were finally tested with AUPRC and AUROC.